Skip to content

Refactor crypto functions code#18664

Merged
Jefffrey merged 6 commits intoapache:mainfrom
Jefffrey:refactor-crypto
Nov 26, 2025
Merged

Refactor crypto functions code#18664
Jefffrey merged 6 commits intoapache:mainfrom
Jefffrey:refactor-crypto

Conversation

@Jefffrey
Copy link
Copy Markdown
Contributor

@Jefffrey Jefffrey commented Nov 13, 2025

Which issue does this PR close?

N/A

Rationale for this change

Deduplicate & simplify code in the crypto functions.

What changes are included in this PR?

Fold Sha224/Sha256/Sha384/Sha512 into a common struct.

Cleanup signature & return types.

Simplify code in datafusion/functions/src/crypto/basic.rs

Are these changes tested?

Existing tests.

Are there any user-facing changes?

Some public methods were removed, though I don't believe they were intended to be used outside of other DataFusion crates.

@github-actions github-actions bot added sqllogictest SQL Logic Tests (.slt) functions Changes to functions implementation labels Nov 13, 2025
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Nov 13, 2025
"computes blake3 hash digest of the given input"
);

macro_rules! digest_to_scalar {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved down, closer to where it's actually used

/// Digest computes a binary hash of the given data, accepts Utf8 or LargeUtf8 and returns a [`ColumnarValue`].
/// Second argument is the algorithm to use.
/// Standard algorithms are md5, sha1, sha224, sha256, sha384 and sha512.
pub fn digest(args: &[ColumnarValue]) -> Result<ColumnarValue> {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to digest.rs as that is the only place it is used

}
s
}
pub fn utf8_or_binary_to_binary_type(
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored away, was only used for return types in sha/md5/digest (which are now simplified)

Comment on lines -254 to -257
/// digest a binary array to their hash values
pub fn digest_binary_array<T>(self, value: &dyn Array) -> Result<ColumnarValue>
where
T: OffsetSizeTrait,
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Essentially inlined these, as didn't see much benefit and they were adding more indirection

@Jefffrey Jefffrey marked this pull request as ready for review November 13, 2025 07:29
}
})
fn return_type(&self, _arg_types: &[DataType]) -> Result<DataType> {
Ok(DataType::Utf8View)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why md5 returns Utf8View ?
digest and sha return Binary

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like md5 will naturally hex the output, thus it returns a string whereas the other digest functions return the binary output as is

};
use arrow::array::{AsArray, GenericStringArray, StringViewArray};
use arrow::array::{Array, ArrayRef, BinaryArray, BinaryArrayType};
use arrow::array::{AsArray, StringViewArray};
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: The two lines above could be merged.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed the imports (wish rustfmt could handle this for us 😔 )


/// Digest computes a binary hash of the given data, accepts Utf8 or LargeUtf8 and returns a [`ColumnarValue`].
/// Second argument is the algorithm to use.
/// Standard algorithms are md5, sha1, sha224, sha256, sha384 and sha512.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Standard algorithms are md5, sha1, sha224, sha256, sha384 and sha512.
/// Standard algorithms are md5, sha224, sha256, sha384 and sha512.

sha1 is not supported in sha.rs and since it is too weak it should not be advertised

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed the docstring; we didn't even support sha1 for this method anyway so seems it was outdated.

@alamb alamb added the api change Changes the API exposed to users of the crate label Nov 25, 2025
Copy link
Copy Markdown
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Jefffrey (and @martin-g ) -- I think this code looks much better now 👌

ScalarValue::Binary($INPUT.as_ref().map(|v| {
let mut digest = $METHOD::default();
digest.update(v);
#[allow(deprecated)]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried removing this #[allow(deprecated)] annotation locally and clippy still passed

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#[allow(deprecated)]

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for checking, fixed; I think last I tested one of the sha libs was causing a deprecation warning here, so I guess it's been updated since then 🎉

+------------------------------------------+
| <binary_hash_result> |
+------------------------------------------+
+------------------------------------------------------------------+
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I double checked this is indeed the output

DataFusion CLI v51.0.0
> select digest('foo', 'sha256');
+------------------------------------------------------------------+
| digest(Utf8("foo"),Utf8("sha256"))                               |
+------------------------------------------------------------------+
| 2c26b46b68ffc68ff99b453c1d30413413422d706483bfa0f98a5e886266e7ae |
+------------------------------------------------------------------+
1 row(s) fetched.
Elapsed 0.030 seconds.

pub mod sha256;
pub mod sha384;
pub mod sha512;
pub mod sha;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think technically this is a breaking API change, but not likely to cause major problems. I tagged this PR as an API change so it shows up in the release notes

@Jefffrey Jefffrey added this pull request to the merge queue Nov 26, 2025
Merged via the queue into apache:main with commit 4eb2933 Nov 26, 2025
28 checks passed
@Jefffrey Jefffrey deleted the refactor-crypto branch November 26, 2025 08:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api change Changes the API exposed to users of the crate documentation Improvements or additions to documentation functions Changes to functions implementation sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants